Closed Bug 1810951 Opened 2 years ago Closed 2 years ago

Implement Glean-based crash pings in Fenix

Categories

(Firefox for Android :: Experimentation and Telemetry, enhancement)

All
Android
enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: gsvelto, Assigned: afranchuk)

References

()

Details

Attachments

(1 file)

As per title, the idea is to implement the minimal crash ping introduced in bug 1790569. While the ping will have only a small amount of functionality it will allow us to gauge Fenix actual crash rate. It will also be improved in lockstep with the desktop version so that we maintain feature equality between the two.

Assignee: nobody → afranchuk

I'm becoming familiar with the Firefox for Android code, and it seems like there's a number of possible scopes for this change.

  1. GeckoView: adding crash pings here may best achieve the desired outcome, depending on how you look at it. We could get crash pings from anyone using GeckoView (which may be projects other than Fenix). On one hand, this would cover all Gecko crashes, but on the other, some people using GeckoView may not like telemetry being built-in. However, it could be opt-in. I expect the only crashes that could be reported would be non-fatal ones.

    It's worth noting that if we want to capture other (non-Gecko) crashes in Fenix, maybe those could be reported separately (e.g. as a separate crash "product" than GeckoView), which may lend itself better to debugging/metrics tracking.

    geckoview doesn't currently contain Glean, however geckoview-omni does.

  2. Android Components lib-crash: the crash library already supports Sentry and Socorro as services, so adding a Glean service here would be appropriate. This is also where the crash count Glean metric is already integrated.

    One might consider this equivalent to (1) in some regards, since an app writer (and Fenix) may opt-in by using lib-crash (as long as they're aware it exists).

  3. Fenix: adding crash pings directly to Fenix would be fairly straightforward. The codebase already uses Glean, sends a number of pings, and has existing logic managing crash events. However, currently crash reporting in Fenix is done through (2), so putting it directly in the Fenix code is likely not the best approach.

My take on this is that 2. is the best choice. Crash pings are per-product so they should not be in GeckoView as that's just the foundation for building a browser (or an app that's not a browser!). Android Components is a good fit because it already has all the plumbing in place and is used by Fenix - but other apps based on GeckoView won't necessarily use it which is fine. That being said the final choice rests on the Fenix team so we'll have to ask for some feedback directly from them.

Okay, I expect they'll agree with your opinion, seeing as that's where all the other crash ping/reporting is handled. I wasn't sure whether GeckoView itself is viewed as an individual product, which is why I brought up the possibility of getting crash information directly from it.

I suppose if someone were using GeckoView and experiencing crashes, our first response (assuming they filed a bug) might be to advise them to add lib-crash so that we can get a better idea of what is going on at the moment of the crash (if it isn't clear enough already).

There already is a Glean crash reporting service defined in lib-crash that handles the messier parts of making things work with Glean running in another process.

I would love to see the lib-crash Glean stuff expanded and/or modified to suit the purposes of this bug, please do let me know if I may be of assistance.

I saw that in the README; I'm glad there's some precedent and existing code to handle that edge case. It seems like that's the most appropriate place for it.

Depends on: 1811928

I've temporarily copied the {metrics,pings}.yaml files into my local copy (in lieu of bug 1811928) and have updated the GleanCrashReporterService to persist/send the crash pings. I'm wondering whether this obsoletes the crash_metrics.crash_count metric: is there any advantage to having the labeled_counter when you could presumably aggregate the same results amongst submitted crash pings? Though worth noting: the crash pings currently don't have a metric which encompasses uncaught exceptions. Would we want to classify that as a process_type (I'd say that's technically correct), or should that be reserved for gecko process types and we add another metric for this sort of distinction?

The work so far is nominal, since I can't get the glean parser to run correctly (but there are no other language check errors so it should compile after that is fixed). I see

Execution failed for task ':lib-crash:gleanGenerateMetricsDocsForDebug'.
> Glean documentation generation failed.

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.

which I'll dig into further if I can figure out where to poke the gradle build files to pass those suggested flags.

I'd keep crash_count for a least a couple releases to ensure there's agreement. We might find some differences between measuring things two ways (well, we always do, but we might find meaningful differences) that we might've missed without something to compare against.

This is implemented and there is a PR open.

Severity: -- → N/A
Depends on: 1815432

The PR has landed.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
See Also: → 1822148
Blocks: 1904847
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: